An implementation of the d distance function for DNA sequences: The wcd d EST clustering algorithm
نویسنده
چکیده
This report gives a skeleton description of the d2 algorithm used for the clustering of expressed sequence tags (ESTs) in the wcd program. It describes how the algorithm works and why some design decisions were made. No experimental evidence is reported here. This is subject of ongoing research.
منابع مشابه
An overview of the wcd EST clustering tool
UNLABELLED The wcd system is an open source tool for clustering expressed sequence tags (EST) and other DNA and RNA sequences. wcd allows efficient all-versus-all comparison of ESTs using either the d(2) distance function or edit distance, improving existing implementations of d(2). It supports merging, refinement and reclustering of clusters. It is 'drop in' compatible with the StackPack clust...
متن کاملAn efficient implementation of the d2 distance function for EST clustering: preliminary investigations
The d2 distance function is commonly used in the clustering of DNA sequences such as expressed sequence tags (ESTs), an important biological application. The use of d2 allows approximate string matching to be performed with a good balance between selectivity and sensitivity. The computational challenges of EST clustering make the efficient evaluation of the d2 function an imperative. The paper ...
متن کاملAlgorithms for clustering expressed sequence tags: the wcd tool
Understanding which genes are active, and when and why, is an important question for molecular biology. Expressed Sequence Tags (ESTs) are a technology used to explore the transcriptome (a record of this gene activity). ESTs are short fragments of DNA created in the laboratory from mRNA extracted from a cell. The key computational step in their processing is clustering : putting all ESTs associ...
متن کاملA Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach
In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...
متن کاملA Fuzzy C-means Algorithm for Clustering Fuzzy Data and Its Application in Clustering Incomplete Data
The fuzzy c-means clustering algorithm is a useful tool for clustering; but it is convenient only for crisp complete data. In this article, an enhancement of the algorithm is proposed which is suitable for clustering trapezoidal fuzzy data. A linear ranking function is used to define a distance for trapezoidal fuzzy data. Then, as an application, a method based on the proposed algorithm is pres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003